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Abstract 

Background: Enumeration of all theoretically possible amino acid compositions is an important problem in several 
proteomics workflows, including peptide mass fingerprinting, mass defect labeling, mass defect filtering, and de 
novo peptide sequencing. Because of the high computational complexity of this task, reported methods for 
peptide enumeration were restricted to cover limited mass ranges (below 2 kDa). In addition, implementation 
details of these methods as well as their computational performance have not been provided. The increasing 
availability of parallel (multi-core) computers in all fields of research makes the development of parallel methods 
for peptide enumeration a timely topic. 

Results: We describe a parallel method for enumerating all amino acid compositions up to a given length. We 
present recursive procedures which are at the core of the method, and show that a single task of enumeration of 
all peptide compositions can be divided into smaller subtasks that can be executed in parallel. The computational 
complexity of the subtasks is compared with the computational complexity of the whole task. Pseudocodes of 
processes (a master and workers) that are used to execute the enumerating procedure in parallel are given. We 
present computational times for our method executed on a computer cluster with 12 Intel Xeon X5650 CPUs (72 
cores) running Windows HPC Server. Our method has been implemented as a 32- and 64-bit Windows application 
using Microsoft Visual C++ and the Message Passing Interface. It is available for download at https://ispace.utmb. 
edu/users/rgsadygo/Proteomics/ParallelMethod. 

Conclusion: We describe implementation of a parallel method for generating mass distributions of all theoretically 
possible amino acid compositions. 



Background 

Mass spectrometry (MS) plays a crucial role in modern 
proteomics as a key method for protein identification 
and quantification. MS provides accurate mass and 
abundance measurements of intact and fragmented pep- 
tide ions, which are then processed by specialized algo- 
rithms and transformed into peptide and protein 
identities. Thus, efficiency of many MS-based proteo- 
mics workflows depends on how well we understand - 
and can utilize - the properties of peptide masses and 
peptide mass distribution. 
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It has been observed that peptide masses have a nonu- 
niform, clustered distribution, which is explained by the 
fact that peptides are made of twenty amino acids with 
specific masses. This distribution consists of repeating 
peaks separated by approximately 1 Da, which become 
taller and wider as the mass increases. Consecutive 
peaks are separated by low populated regions (quiet 
zones) and gaps (forbidden zones)-that is, the mass 
ranges for which there exist no possible sequences of 
amino acids. Nonuniformity (peaks, gaps) and discrete 
nature of the mass distribution of peptides are impor- 
tant for two major problems in MS-based proteomics: 
peptide identification and de novo sequencing. 

The knowledge of the mass distribution of a particular 
type of peptide (for example, non-modified tryptic pep- 
tides) can be used to facilitate peptide identification in a 
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number of ways. Forbidden zones allow us to filter out 
MS signals corresponding to non-target species (non- 
peptide contaminants or modified peptides) early on, 
before doing any complicated processing of MS data. 
Dodds and coworkers [1] showed that this results in 
exponential improvements in statistical significance and 
discrimination of protein identification based on peptide 
mass fingerprinting on the Mascot platform. Nonover- 
lapping or partially overlapping peaks in the mass distri- 
butions of different types of peptides allow recognition 
of these types based solely on precursor masses. For 
example, Spengler and Hester [2] showed that accurate 
masses (with accuracy of 0.1 or even 1 ppm) allow phos- 
phorylated and nonmodified peptides to be distin- 
guished. Lehmann and coworkers [3] and Jones and 
coworkers [4] showed that this is possible for glycopep- 
tides and lipids. In addition, there have been many sug- 
gestions for label tags shifting the mass of labeled 
peptides to quiet or forbidden zones in order to allow 
easier identification and quantification of these peptides 
[5]. 

The major drawback of peptide identification algo- 
rithms based on database search is their inability to 
identify peptides that are not present in the reference 
database. De novo sequencing algorithms are designed 
to restore peptide compositions from MS data without 
the use of peptide databases. These algorithms employ 
several strategies for MS data analysis s [6], one of 
which is based on the fact that for a given mass there 
exist only a finite (though sometimes very large) number 
of amino acid sequences (or amino acid compositions) 
that can assume that mass, and that these sequences 
(compositions) can be explicitly enumerated. The use of 
the masses of fragment ions can further reduce the 
number of admissible compositions. Several reports 
have shown the feasibility of this strategy, especially for 
high accuracy data provided by modern Fourier trans- 
form mass spectrometers [7-9]. 

Proteomics applications mentioned above rely on spe- 
cific properties of the peptide mass distributions that 
can only be obtained by enumerating all theoretically 
possible peptides. Moreover, in many circumstances it is 
impossible to generate these distributions once and for 
all, as many parameters can vary from experiment to 
experiment (peptide modifications, enzymatic specificity, 
number of missed cleavages, etc.) Thus, it is desirable to 
be able to generate peptide mass distributions (or some 
parts of these distributions) "to order" and, therefore, to 
be able to generate them fast. 

Several works focusing on different MS-based proteo- 
mics applications employed enumeration of all theoreti- 
cally possible peptides [8,10-13]. Because of the high 
computational complexity of the task, enumeration of 
peptides was done for the mass range below 2 kDa, 



which limited applicability of the obtained results. Also, 
even for this mass range long computational times and 
extensive computational capabilities were often required. 
Olson and others [8] mentioned the use of a parallel 
method for peptide enumeration, but details of its 
implementation as well as its computational perfor- 
mance were not reported. 

In a recent paper [14] we described the mass distribu- 
tion of all theoretically possibly tryptic peptides made of 
20 amino acids, up to the mass of 3 kDa. The paper 
provided detailed characterization of forbidden zones 
and amino acid compositions of peptides from the quiet 
zones. We showed how forbidden zones shrink over the 
mass range, where they completely disappear and how 
they depend on the measured mass accuracy. We found 
that peptide sequence compositions in the quiet zones 
are less diverse than those in the peaks of the distribu- 
tion, and that forbidden zones may be extended by elim- 
inating certain types of unrealistic compositions. We 
also characterized symmetry of mass peaks and the 
accuracy of the Mann's equations [13] for the mass peak 
position and width. Our study was made possible by 
advancing computational techniques for the enumera- 
tion of amino acid compositions. 

In this paper, we describe in detail a parallel method 
for enumerating all amino acid compositions up to a 
given length. First, we present a pseudocode for recur- 
sive procedures which are the core of this method. We 
then show how a single task of enumerating all peptide 
compositions can be divided into smaller subtasks that 
can be executed in parallel. We also show how the com- 
putational complexity of these subtasks compares with 
the computational complexity of the primary task. 
Finally, we provide pseudocode of processes (a master 
and workers) that are used to execute the enumerating 
procedure in parallel. To the best of our knowledge this 
is the first description of a computational method for a 
complete and unbiased enumeration of all theoretically 
possible peptides. We present computational times for 
our method, implemented by using Microsoft Visual C+ 
+ and the Message Passing Interface (MPI), and exe- 
cuted on a computer cluster with 12 Intel Xeon X5650 
CPUs running Windows HPC Server 2008. The mass 
and length limits are input parameters of the program. 

Implementation 

Peptide compositions 

Any peptide composition is represented by a numerical 
vector («!, n 2 ,—, n 2 o), whose i-th component is equal to 
the number of times the i-th amino acid occurs in the 
peptide. For example, sequence aia 2 o<Zi«i has composi- 
tion (3, 0,..., 0, 1). In some cases, it is convenient to con- 
sider peptides as sequences composed of less or more 
than 20 letters (tryptic peptides without missed 
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cleavages, post-translationaly modified peptides, etc.). 
For this reason, let us adopt a more general notation: 
assume we have an alphabet of N characters and com- 
position vectors n 2 ,—, n N ). The length of a composi- 
tion is defined as L = n\ + n 2 +■■■ + n N . If w,- is the 
monoisotopic mass associated with the i-th letter, then 
the monoisotopic mass of a composition is defined as m 
= Hiffli + K2IM2 +... + n N m N (the monoisotopic mass of 
H 2 0 and a proton may be added to this mass if 
necessary.) 

For a single sequence of letters we have a uniquely 
defined composition, while for a single composition of 
length L we have 



n\ \n 2 \ ... hn! 

corresponding sequences, given by the multinomial 
coefficient. Note that all these sequences will have the 
same mass, which explains the convenience of enumer- 
ating peptide compositions instead of peptide sequences 
in order to obtain all theoretically possible peptide 
masses. 

The number of compositions of length L is equal to 
the number of ways to choose L elements from a set of 
N elements if repetitions are allowed: 

[N + L-l\ (N + L-l)! 
V L J L!(N- 1)! ' 

The number of compositions of all lengths not greater 
than L (including one composition of length 0) is equal 
to 

x^/N + fe-l\ (N + L\ 
which follows from the equation 




The latter is based on the recurrence relation 




and can be found in the book by Graham and others 
[15]. The number of sequences of all lengths not greater 
than L is equal to 

N(N L - 1) 
N — 1 

Table 1 shows the number of compositions and 
sequences for peptides comprised of 20 amino acids. 
Note how the number of sequences exceeds the number 
of compositions as the length of peptides grows. 

Enumerating peptide compositions 

Figure 1 shows the pseudocode of a basic recursive pro- 
cedure for enumerating all compositions of length not 
greater than L for an alphabet of N letters. Array c 
holds current composition n 2 ,—, n N ) and is indexed 
from one. Procedure Mass returns the mass of the input 
composition c. For a given length L, procedure GenBasic 
should be called with parameters (L, start = 1). Note 
that the depth of recursion for this procedure is equal 
to N - 1. The number of compositions enumerated by 
this procedure is given by equation (3). 

Procedure GenBasic begins enumeration with compo- 
sition (0, 0,..., 0) and first generate all compositions with 
n N ranging from 0 to L. It then sets n N _i to 1, and gen- 
erates all compositions with n N ranging from 0 to L - 1, 
and so on. The last composition in this generation pro- 
cess is (L, 0,..., 0). Essentially, the compositions are gen- 
erated like AT-digit numbers, in ascending order, with 
requirement that the sum of the "digits" must not be 
greater than L. For instance, for AT = 3 and L = 2 the 
procedure generates all compositions up to length 2 in 
the following order: (0, 0, 0), (0, 0, 1), (0, 0, 2), (0, 1, 0), 
(0, 1, 1), (0, 2, 0), (1, 0, 0), (1, 0, 1), (1, 1, 0), (2, 0, 0). 

Several changes to procedure GenBasic will make it 
faster. First, if L is equal to zero on line 3 then there is 



Table 1 Number of compositions and sequences comprised of 20 letters, of length not greater than L, for L ranging 
from 3 to 10, and their ratios (rounded) 



Length of Peptides (/.) 


Number of Compositions [A) 


Number of Sequences (B) 


Ratio B/A 


3 


1,770 


8,420 


5 


4 


10,625 


1 68,420 


16 


5 


53,129 


3,368,420 


63 


6 


230,229 


67,368,420 


293 


7 


888,029 


1,347,368,420 


1,517 


8 


3,108,104 


26,947,368,420 


8,670 


9 


10,015,004 


538,947,368,420 


53,814 


10 


30,045,014 


1 0,778,947,368,420 


358,760 
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next call of the generating procedure. By doing this, we 
avoid the need to recalculate the mass of the part of the 
composition that has not been changed. 

Figure 2 shows the pseudocode of procedure Gen which 
is a faster version of procedure GenBasic. It generates a 
histogram of peptide compositions' masses, instead of 
printing them, which is more suitable for its further use. 
The histogram, stored in global array massHist, contains 
the number of compositions falling into the mass bins of 
width 0.001 Da. Procedure Round returns the rounded 
integer value of its argument. Note that since procedure 
Gen calculates the mass of compositions "on the fly", we 
do not need to store compositions in array c, so lines 10 
and 16 may be removed. We assume that array aam of 
size N stores masses m lt ra 2 ,..., m N . Procedure Gen should 
be called with parameters {L, start = 1, m 0 = 0). 

Enumerating peptide compositions in parallel 

The task of enumerating all compositions («i, «2>---, n N ) 
can be split into smaller independent subtasks or jobs 
that can be executed in parallel. Indeed, a single call to 
procedure Gen with parameters (L, 1, 0) is equivalent to 
L+l calls with parameters (L, 2, 0), (L - 1, 2, aam[l]),..., 
(0, 2, aam[l]*L), while «i is set to 0, 1,..., L, 



1 procedure Gen(Z, start, mo) 

2 comment: global array massHist has been initialized with zero values 

3 if start < N then 

4 if L = 0 then 

5 £^Round(m 0 *1000) 

6 massHist[k] <— massHist[k] + 1 

7 else 

8 m <— mo — aam[start] 

9 for i <- 0 to L 

10 c[start] <— i 

11 m <— m + aam[start] 

12 Gen(X - i, start + 1 , m) 

13 else 

14 m <— mo - aam[start] 

15 for i <- 0 to Z 

16 c[N] ^ i 

17 m <— m + aam[N] 

18 £^Round(m*1000) 

19 massHist[k] <— massHist[k] + 1 

Figure 2 Pseudocode for recursive procedure Gen, a faster version of GenBasic The procedure generates the mass histogram of all 
compositions of length not greater than L 



1 procedure GenBasic(L, start) 

2 if start < N then 

3 for i <— 0 to L 

4 c[start] <— i 

5 GenBasic(X - i, start +1) 

6 else 

7 for ;<-0 to X 

8 c[N] ^ i 

9 m <— Mass(c) 
10 print m 

Figure 1 Pseudocode for recursive procedure GenBasic which 
enumerates all compositions of length not greater than L and 
prints their masses 



no need to make assignment on line 4 and call GenBasic 
on line 5, since it is already known that the rest of the 
composition will contain zeros only. Second, we can cal- 
culate the mass of a composition as soon as its compo- 
nent «; becomes known, and then pass this mass to the 
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correspondingly (Figure 3). As before, we assume that 
array aam stores masses mi, m 2 ,..., m N of the used 
amino acids. Certainly, we will have to combine mass 
histograms produced by each call of procedure Gen, 
which can be done knowing parameters of each job, 
described by a triplet (L, start, m 0 ). 

To illustrate this idea, consider again our example 
with N = 3 and L = 1. The primary task is to enumerate 
the following compositions: (0, 0, 0), (0, 0, 1), (0, 0, 2), 
(0, 1, 0), (0, 1, 1), (0, 2, 0), (1, 0, 0), (1, 0, 1), (1, 1, 0), (2, 
0, 0). This can be accomplished by independent enu- 
meration of three subsets of compositions: (i) (0, 0, 0), 
(0, 0, 1), (0, 0, 2), (0, 1, 0), (0, 1, 1), (0, 2, 0); (ii) (1, 0, 
0), (1, 0, 1), (1, 1, 0); and (iii) (2, 0, 0). Compositions (i) 
can be enumerated by setting « x = 0 and calling Gen 
with parameters (L = 2, start = 1, m 0 = 0); compositions 
(ii) can be enumerated by setting «j = 1 and calling Gen 
with parameters (L = 1, start = 2, m 0 = mi); and single 
composition (iii) is enumerated by setting n\ = 2 and 
calling Gen with parameters (L = 0, start = 2, m 0 = 
2m x ). 

How can we create a list or table of jobs given the 
initial job described by parameters (L, 1, 0)? First, job 
(L, 1, 0) is replaced by L+l jobs (L, 2, 0), (I - 1, 2, aam 
[1]),..., (0, 2, aam[l]*L) (Figure 3). If, for a given L, job 
(L, 2, 0) is executed in acceptable time, we do not need 
to do anything else, and the table of jobs has been initi- 
alized. Otherwise, we can split job {L, 2, 0) into L+l 
jobs with start = 3, and similarly split other jobs with 
start = 2. Thus, for all jobs with start = 2 there is cer- 
tain i max ,2 such that if the first parameter of the job is 
larger than i max , 2 then this job should be split into jobs 
with start = 3. When this is done, we move to the jobs 
with start = 3 and process them in a similar manner: all 
jobs that have first parameter larger than I max ,3 should 
be split into jobs with start = 4. We continue this until 
each job in the job table can be executed in acceptable 
time (see additional notes on this in the Discussion 
section). 



When the table of jobs has been initialized, the jobs 
from the table can be assigned to computation pro- 
cesses. In this context, it is convenient to think about a 
master process, which does these assignments (Figure 
4), and worker processes (Figure 5), which execute the 
assigned jobs and return results back to the master. The 
master then combines partial mass histograms com- 
puted by workers into a single final mass histogram. 
There may be different strategies utilized in assigning 
the jobs. For example, larger jobs (with larger L) may be 
assigned prior to smaller jobs (with smaller L). In our 
experiments, which are presented in the Discussion sec- 
tion, there were no particular strategy in job assign- 
ments (jobs were assigned in the order in which they 
had been inserted into the job table). 

The data exchange between the master and workers 
(Figure 4, lines 12, 16; Figure 5, lines 2, 5, 6) can be 
organized by using functions MPI Send and MPI Re- 
ceive from any library implementing MPI [16]. In our 
implementation, we used Microsoft Visual C++ and 
MPI library from Microsoft HPC SDK Pack. 

Results and Discussion 

It is worthwhile to make several additional comments 
on procedure Gen, presented on Figure 2. Various prac- 
tical considerations may suggest using an upper limit on 
the mass of peptide compositions that one wants to 
enumerate. In this case, a significant improvement in 
computation speed may be achieved by canceling the 
enumeration of compositions whose mass exceeds a 
given limit. If array aam contains mass values in ascend- 
ing order, we can return from function Gen as soon as 
the current mass (m 0 in line 5, m in lines 11 and 17) 
exceeds the threshold. To illustrate a possible gain in 
speed that may be achieved by using a maximum mass 
limit, consider enumeration of compositions corre- 
sponding to all tryptic peptides up to the length of 30. 
It takes 1 hour 20 minutes to complete the full enu- 
meration of such compositions, while with the mass 



start = 1 



start 



start ■ 



Gen(L, 1, 0) -»• m 


= 0, Gen(L, 2, 0) -> 


«i 


= 0, n 2 = 


0, Gen(L, 3, 0) 


«i 


= 1, Gen(L -1,2, aam[\]) 


Ml 


= 0, M 2 = 


1, Gen(L -1,3, aam[2]) 


«i 


= 2, Gen(L - 2, 2, aam[l]*2) 


H\ 


= 0, n 2 = 


2, Gen(L-2,3,aam[2]*2) 


«i 


= L, Gen(0, 2, aam[l]*L) 


Ml 


= 0,n 2 = 


L, Gen(0, 3, aam[2]*L) 



Figure 3 Creating multiple jobs from a single job of enumerating all compositions A single call to procedure Gen with parameters (L, 
0) is equivalent to L+] calls with parameters (L, 2, 0), (L - 1, 2, aam [1 ]),.., (0, 2, aam [1]*/.), while n, is set to 0, 1,.., L, correspondingly. Any job 
with start = 2 can be further expanded into L+] jobs with start = 3, as shown for Gen(/_, 2, 0). 
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1 procedure CreateMassHistMaster(X) 

2 create numOfWorkers work processes 

3 jobs <— CreateJobs(Z) 

4 numOfJobs <— number of created jobs 

5 numOfBusyWorkers <— 0 
6 

7 while numOfJobs > 0 or numOfBusyWorkers > 0 

8 comment: find free process and assign next job 

9 p^l 

10 while/? < numOfProcs and numOfJobs > 0 

11 if process /? is free then 

12 assign next unassigned job from y'o&s to process p 

13 numOfJobs <— numOfJobs - 1 

14 numOfBusyWorkers <— numOfBusyWorkers + 1 

15 /?^/?+l 

1 6 wait for massHist from any worker 

17 update massHistGlobal using massHist 

18 numOfBusyWorkers <— numOfBusyWorkers — 1 
19 

20 tell work processes to terminate 

21 return massHistGlobal 



Figure 4 Pseudocode for procedure CreateMassHistMaster 



limit of 3 kDa (heavier peptides are rarely identified in 
MS experiments) it takes only 11 minutes, as about 87% 
of the compositions can be skipped (Tables 2 and 3). 

There may be other modifications to this procedure, 
depending on the intended use of the generated mass 
distribution. For example, the maximum number of 
occurrences of each amino acid in a peptide may be 
made limited by a threshold based on the amino acid 
and the length and/or mass of the peptide. This would 



make the generated mass distribution more realistic and 
may increase the lengths of forbidden zones [14]. 
Instead of counting the number of peptide composi- 
tions, one can count the number of peptide sequences 
using equation (1). In this case, efficient computation of 
factorials "on the fly" can be implemented similar to the 
computation of peptide masses. If we are interested in 
enzyme-specific peptides, the procedure can be modified 
to allow a given number of missed cleavages. The 



1 procedure CreateMassHistWorker(X) 

2 wait for job from the master 

3 while job is not to terminate 

4 massHist <— Gen(job.L, job. start, job.m) 

5 send massHist to the master 

6 wait for job from the master 
7 

8 terminate 



Figure 5 Pseudocode for procedure CreateMassHistWorker 
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Table 2 Computation times for enumerating all tryptic compositions up to the length of 30, for different sets of jobs 
and number of work processes, with and without the maximum mass limit 

Task Number Job Table Computation Time 

of Workers 





Number 
of Jobs 


start 


^-max,2 


^-max,3 


^-max,4 


massMax = 3 kDa 


no massMax 


1 


1 


1 








6 h 03 min 


35 h 11 min 


5 


30 


2 








2 h 12 min 


14 h 52 min 


30 


30 


2 








1 h 39 min 


13 h 32 min 


30 


255 


< 3 


20 






28 min 


5 h 02 min 


71 


255 


< 3 


20 






27 min 


4 h 57 min 


71 


679 


< 5 


20 


24 


28 


1 1 min 


1 h 20 min 



Computations were done using 71 work processes executed on a cluster with 12 Intel Xeon X5650 CPUs running Windows HPC Server 2008. 



number of amino acids (N) and their monoisotopic 
masses may vary depending on specific proteases used 
in sample preparation, possible post-translational or 
chemical modifications, and other factors. The resolu- 
tion of the mass histogram (0.001 Da) may be changed 
as well, without significantly impairing computational 
speed. 

An important question is how the job [L, start + 1, 0) 
compares with the job (L, start, 0) in terms of computa- 
tional complexity. Let us denote the number of compo- 
sitions enumerated by the first procedure by C(L, start 
+ 1), and the number of compositions enumerated by 
the second procedure by C(L, start). Using equation (3) 
we have: 

„, T . f N - start + 1 +L\ 
C(I, start) = ^ N _ stan+l J. 

Hence, 

C(L, start) L 
C[L, start + 1 ) N — start + 1 



Table 3 Computation times for enumerating all tryptic 
compositions with different maximum lengths, with and 
without maximum mass limit 



L 


Computation Time 




maxMass = 3 kDa 


no mass limit 


25 


19 min 


29 min 


30 


1 1 min 


1 h 20 min 


35 


8 min 


5 h 38 min 


40 


8 min 


38 h 28 min 


45 


14 min 


> 96 h 


50 


29 min 





Parameters of the job table were: start < 7, L maXi2 - 20, /. max ,3 = 24, L max4 = 
28, /. maXi 5 = 34, /.max,6 = 40. Computations were done using 71 work processes 
executed on a cluster with 12 Intel Xeon X5650 CPUs running Windows HPC 
Server 2008. 



For example, if N = 20, L = 40, and start = 1, then C 
(40, 1)/C(40, 2) = 3, which means that we get a three- 
fold decrease in computation time by replacing one call 
Gen(40, 1, 0) by 41 calls to Gen with start = 2, executed 
in parallel. Similarly, we get 

C(L, start) N — start + 1 

C(L — 1, start) L 

Thus, if N = 20, L = 40, and start = 2, then C(40, 2)IC 
(39, 2) « 1.5, which means that Gen(39, 2, 0) will be 
about 1.5 times faster than Gen(40, 2, 0). 

Initialization of a job table requires the maximum 
value of parameter start, as well as parameters i max ,2 > 
I maii 3 , etc., to be specified. These can be determined 
empirically based on the available computational 
resources and the number of processes that can be 
executed in parallel. For example, we found that for 
enumerating tryptic peptide compositions of masses up 
to 3 kDa by using 72 processes running on 12 Intel 
Xeon X5650 CPUs the following parameters would 
give good performance: start < 7, i max> 2 = 20, L max ,3 = 
24, I max>4 = 28, L max>5 = 34, I max>6 = 40. The tuning of 
these parameters is important to ensure good perfor- 
mance, as they directly affect the computation time 
(Table 2). 

It should be noted that a job table may have jobs with 
the same parameters L and start, differing only in M. 
For example, consider the case illustrated in Figure 3. 
Splitting job [L, 2, 0) into L+l jobs with start = 3 will 
give us, among others, job (£-1, 3, aam[2]). On the 
other hand, splitting job {L-l, 2, aam[l]) into L jobs 
with start = 3 gives us job (L-l, 3, aam[V\). It is clear 
that execution of these two jobs can be done in one call 
to function Gen, which should be modified to be able to 
accept two input masses mi, instead of mo, and to 
work with two variables m , m 2 instead of m. In a simi- 
lar manner, execution of more than two jobs may be 
done in one call to function Gen. This approach will 
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lead to a significant speed-up in computations (it has 
not been implemented in our code). 

In fact, a job table may have jobs with all three para- 
meters L, start and M being equal. Consider, for exam- 
ple, a primary job with L = 40, start = 0, and m 0 = 0. 
Assume that array aam holds amino acid masses in 
ascending order. Then the first five masses stored in 
this array will correspond to glycine (G), alanine (A), 
serine (S), proline (P) and valine (V), and we can denote 
the first five elements of a composition by n G , n A , n s , 
n P , ny. Assume that the job splitting algorithm (see sub- 
section 2.3) yields the following two jobs: 

r\Q = 2, Ha = 0, ng = 0, n P = 0, n v = I, start = 6, L = 37, 
tic = 0,Ua = 3, n$ = 0, rip = 0, n v = 0, start = 6, L = 37. 

Then these two jobs will have the same m 0 = 213.111 
Da, since tripeptides GGV and AAA are isomeric. If a 
job table is generated using parameters start < 7, L max ,2 
= 20, i max ,3 = 24, i maXi 4 = 28, i ma x,5 = 34, L maXi 6 = 40, 
then for L = 40 about 2% of all jobs will be duplicates; 
for L = 50 - about 29%, and for L = 60 - about 47%. In 
the case when we are only interested in the mass distri- 
bution of peptide compositions, there is no need to exe- 
cute duplicate jobs. If certain job occurs k times, it is 
enough to execute it once and then multiply the result- 
ing histogram by k before adding it to the final histo- 
gram. However, if we would like to get every peptide 
composition, then we cannot remove duplicate jobs. 

In the end of this section, we present Table 3 which 
shows computation times for enumeration of tryptic 
compositions for a range of lengths between 25 and 55, 
with and without the use of a maximum mass limit. The 
numbers in the second column may seem counterintui- 
tive at first, since, for example, it takes 19 min to gener- 
ate the distribution for L = 25 and 11 min for L = 35. 
The explanation, however, lies in using the maximum 
mass limit of 3 kDa. The longest job for the task with L 
= 25 was L = 24, start = 2, m 0 = 0, and it executed for 
19 min. The longest job for the task with L = 30 was L 
= 24, start = 2, mo = 285, and it executed for 8 min. 
The difference in 11 min comes from the fact that more 
compositions were canceled out in the second case 
because of the mass limit that was used. 

We would like to note, in addition to Table 3, that 
enumeration of all tryptic peptides having the mass no 
greater than 3 kDa (the length of these peptides does 
not exceed 51) took 32 minutes. 

Conclusions 

In this paper, we presented a detailed description of a 
parallel method for enumerating all theoretically possi- 
ble amino acid compositions and discussed different 
aspects of its implementation. Enumeration of all amino 



acid compositions is important in several proteomics 
workflows, including peptide mass fingerprinting, mass 
defect labeling, mass defect filtering, and de novo pep- 
tide sequencing. Given the fact that multi-core compu- 
ters and computer clusters are becoming increasingly 
available, it is natural to address this computationally 
expensive task using a parallelization approach. 

We believe that by reducing computational times from 
hours to minutes, the applicability of the enumeration of 
all amino acid compositions in various proteomics stu- 
dies may be significantly improved and extended. We 
have used the method described in this work to charac- 
terize forbidden and quiet zones in the mass distribution 
of tryptic peptides [14]. In the next step, we plan to 
apply this method to enhance the accuracy of protein 
identification in real mass spectrometry data. Our 
method has been implemented as a 32- and 64-bit Win- 
dows application using Microsoft Visual C++ and MPI. 
It is freely available for download at https://ispace.utmb. 
edu/users/ rgsadygo/Proteomics/ParallelMethod. 

Availability and Requirements 

* Project Name: PepComp 

• Project home page: https://ispace.utmb.edu/users/ 
rgsadygo/Proteomics/ParallelMethod 

♦ Operating System: MS Windows 

* Other Requirements: Message Passing Interface, 
multi-core CPU 

• Programming Language: Visual Studio C++ 

♦ License: No license needed 
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